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Abstract 



Thompson Sampling is one of the oldest heuristics for multi-armed bandit problems. It is a ran- 
domized algorithm based on Bayesian ideas, and has recently generated significant interest after several 
studies demonstrated it to have better empirical performance compared to the state of the art meth- 
ods. In this paper, we provide a novel regret analysis for Thompson Sampling that simultaneously 
proves both the optimal problem-dependent bound of (1 + e) J2i + O(fr) and the first near-optimal 

problem-independent bound of 0(V NT InT) on the expected regret of this algorithm. Our near-optimal 
problem-independent bound solves a COLT 2012 open problem of Chapelle and Li. The optimal problem- 
dependent regret bound for this problem was first proven recently by Kaufmann et al. Q3]. Our novel 
martingale-based analysis techniques are conceptually simple, easily extend to distributions other than 
the Beta distribution, and also extend to the more general contextual bandits setting [2]. 

1 Introduction 

Multi- armed bandit problem models the exploration/exploitation trade-off inherent in sequential decision 
problems. One of the early motivations for studying MAB problem was clinical trials: suppose that we have 
N different treatments of unknown efficacy for a certain disease. Patients arrive sequentially, and we must 
decide on a treatment to administer for each arriving patient. To make this decision, we could learn from 
how the previous choices of treatments fared for the previous patients. After a sufficient number of trials, we 
may have a reasonable idea of which treatment is most effective, and from then on, we could administer that 
treatment for all the patients. However, initially, when there is no or very little information available, we 
need to explore and try each treatment sufficient number of times. We wish to do this exploration in such a 
way that we can find the best treatment and start exploiting it as soon as possible. The MAB problem is to 
decide how to choose the treatment for the next patient, given the outcomes of the treatments so far. Today, 
multi-armed bandit problem has a diverse set of applications some of which will be mentioned shortly. 

Many versions and generalizations of the multi-armed bandit problem have been studied in the literature; 
in this paper we will consider a basic and well-studied version of this problem: the stochastic multi-armed 
bandit problem. Among many algorithms available for the stochastic bandit problem, some popular ones 
include Upper Confidence Bound (UCB) family of algorithms, (e.g., [Hill], and more recently [3l 151 [TBI 
113]). which have good theoretical guarantees, and the algorithm by [9], which gives optimal strategy under 
Bayesian setting with known priors and geometric time-discounted rewards. In one of the earliest works on 
stochastic bandit problems, [22] proposed a natural randomized Bayesian algorithm to minimize regret. The 
basic idea is to assume a simple prior distribution on the parameters of the reward distribution of every 
arm, and at any time step, play an arm according to its posterior probability of being the best arm. This 
algorithm is known as Thompson Sampling (TS), and it is a member of the family of randomized probability 
matching algorithms. TS is a very natural algorithm and the same idea has been rediscovered many times 
independently in the context of reinforcement learning, e.g., in J53[ [1151 [H] . We emphasize that although 
TS algorithm is a Bayesian approach, the description of the algorithm and our analysis apply to the prior- 
free stochastic multi-armed bandit model where parameters of the reward distribution of every arm are 



fixed, though unknown (see Section ITTTj) . One could interpret the "assumed" Bayesian priors as the current 
knowledge of the algorithm about the arms. Thus, our regret bounds for Thompson Sampling are directly 
comparable to the regret bounds for UCB family of algorithms which are a frequcntist approach to the same 
problem. 

Recently, TS has attracted considerable attention. Several studies (e.g., [TTJ HH] ) have empirically 
demonstrated the efficacy of Thompson Sampling: |20j provides a detailed discussion of probability matching 
techniques in many general settings along with favorable empirical comparisons with other techniques. [6] 
demonstrate that empirically TS achieves regret comparable to the lower bound of |15| : and in applications 
like display advertising and news article recommendation, it is competitive to or better than popular methods 
such as UCB. In their experiments, TS is also more robust to delayed or batched feedback (delayed feedback 
means that the result of a play of an arm may become available only after some time delay, but we are 
required to make immediate decisions for which arm to play next) than the other methods. A possible 
explanation may be that TS is a randomized algorithm and so it is unlikely to get trapped in an early bad 
decision during the delay Microsoft's adPredictor ([TO]) for CTR prediction of search ads on Bing uses the 
idea of Thompson Sampling. 

Despite being easy to implement, competitive to the state of the art methods, and popular in practice, TS 
lacked a strong theoretical analysis. [TTJ[T7] provide weak guarantees, namely, a bound of o(T) on expected 
regret in time T . Significant progress was made in the recent work of [1] and |14j . In pQ, the first logarithmic 
bound on expected regret of TS algorithm were proven. [T3] provided a bound that matches the asymptotic 
lower bound of |15j for this problem. However, both these bounds were problem dependent, i.e. the regret 
bounds are logarithmic in T when the problem parameters, namely the mean rewards for each arm, and 
their differences, are assumed to be constants. The problem-independent bounds implied by these existing 
works were far from optimal. Obtaining a problem-independent bound that is close to the lower bound of 
fi(V NT) was also posed as an open problem by Chapaelle and Li [7]. 

In this paper, we give a regret analysis for Thompson Sampling that provides both optimal problem- 
dependent and near-optimal problem-independent regret bounds for Thompson Sampling. Our novel martingale- 
based analysis technique is conceptually simple (arguably simpler than the previous work). Our technique 
easily extends to distributions other than Beta distribution, and it also extends to the more general con- 
textual bandits setting [2]. While the basic idea for the analysis in the contextual bandits setting of [2] is 
inspired by the idea in this paper, the details are substantially different. 

Before stating our results, we describe the MAB problem and the TS algorithm formally. 



1.1 The multi-armed bandit problem 

We consider the stochastic multi-armed bandit (MAB) problem: We are given a slot machine with N arms; 
at each time step t = 1, 2, 3, . . ., one of the N arms must be chosen to be played. Each arm i, when played, 
yields a random real- valued reward according to some fixed (unknown) distribution with support in [0, 1]. 
The random reward obtained from playing an arm repeatedly are i.i.d. and independent of the plays of the 
other arms. The reward is observed immediately after playing the arm. 

An algorithm for the MAB problem must decide which arm to play at each time step t, based on the 
outcomes of the previous t — 1 plays. Let fa denote the (unknown) expected reward for arm i. A popular 
goal is to maximize the expected total reward in time T, i.e., E[X]fLi Mi(t)L where i(t) is the arm played in 
step t, and the expectation is over the random choices of i(t) made by the algorithm. It is more convenient 
to work with the equivalent measure of expected total regret: the amount we lose because of not playing 
optimal arm in each step. To formally define regret, let us introduce some notation. Let fa* := max, fa, and 
A, := fi* — fa. Also, let ki(t) denote the number of times arm i has been played up to step t — 1. Then the 
expected total regret in time T is given by 

E [K{T)] = E [ELi(M* - fM(t))] = Ei A* ■ E [ki(T + 1)] . 
Other performance measures include PAC-style guarantees; we do not consider those measures here. 
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1.2 Thompson Sampling 

We provide the details of Thompson Sampling algorithm and our analysis for the Bernoulli bandit problem, 
i.e. when the rewards are either or 1, and for arm i the probability of success (reward =1) is This 
description of Thompson Sampling follows closely that of [BJ. A simple extension of this algorithm to general 
reward distributions with support [0, 1] is described in [T], which seamlessly extends our analysis for Bernoulli 
bandits to general stochastic bandit problem. 

The algorithm for Bernoulli bandits maintains Bayesian priors on the Bernoulli means /Vs. Beta distribu- 
tion turns out to be a very convenient choice of priors for Bernoulli rewards. Let us briefly recall that beta dis- 
tributions form a family of continuous probability distributions on the interval (0, 1). The pdf of Beta(a, 8), 
the beta distribution with parameters a > 0, 8 > 0, is given by f(x;a,@) = I !j^p^ a; a ~ 1 (l — x) 13 " 1 . The 
mean of Beta(a, 8) is a/ (a + /3); and as is apparent from the pdf, higher the a,/3, tighter is the concen- 
tration of Beta(a,/3) around the mean. Beta distribution is useful for Bernoulli rewards because if the 
prior is a Bcta(a, 8) distribution, then after observing a Bernoulli trial, the posterior distribution is simply 
Beta(a + 1, (3) or Beta(a, 8 + 1), depending on whether the trial resulted in a success or failure, respectively. 

The Thompson Sampling algorithm initially assumes arm i to have prior Beta(l, 1) on /Ltj, which is 
natural because Beta(l, 1) is the uniform distribution on (0, 1). At time t, having observed Si(t) successes 
(reward = 1) and F, (t) failures (reward = 0) in fc;(t) = Si(t) + Fi(t) plays of arm i, the algorithm updates the 
distribution on /ij as Beta(Si(t) + 1, F^t) + 1). The algorithm then samples from these posterior distributions 
of the /ii's, and plays an arm according to the probability of its mean being the largest. We summarize the 
Thompson Sampling algorithm below. 

Algorithm 1: Thompson Sampling for Bernoulli bandits 
For each arm i = 1, . . . , N set Si = 0, F, = 0. 
foreach t = 1,2,..., do 

For each arm i = 1, . . . , N, sample 8i(t) from the Beta(Si + 1, Fi + 1) distribution. 

Play arm i(t) := argmax^ 9i(t) and observe reward rt- 

If r t = 1, then S l(t) = 5 i(t) + 1, else = F i(t ) + 1. 
end 



1.3 Our results 

In this article, we bound the finite time expected regret of Thompson Sampling. From now on we will 
assume that the first arm is the unique optimal arm, i.e., /i* = /.ti > argmax^i /!.;. Assuming that the first 
arm is an optimal arm is a matter of convenience for stating the results and for the analysis and of course 
the algorithm does not use this assumption. The assumption of unique optimal arm is also without loss 
of generality, since adding more arms with ^ = [i* can only decrease the expected regret; details of this 
argument were provided in [T]. 

Theorem 1. (Problem-dependent bound) For the N -armed stochastic bandit problem, Thompson Sam- 
pling algorithm has expected regret 

n . 

E^ ( T)]<(l + e )g^A l + (-) 

in time T , where fJ-i) = fJ-i log ^ + Mi) 1°S [i-^j ■ ^ e big-Oh notation^ in above assumes fii, A^, i = 
1, . . . , N to be constants. 

Theorem 2. (Problem-independent bound) For the N-armed stochastic bandit problem, Thompson 
Sampling algorithm has expected regret 

E{K(T)] < O(VNThiT) 

1 For any two functions f(n), g(n), f(n) = 0(g(n)) if there exist two constants no and c such that for all n > no, /(n) < cg(n). 
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in time T, where the big- Oh notation hides only the absolute constants. 



Let us contrast our bounds with the previous work. Let us first consider the problem-dependent regret 
bounds, i.e., regret bounds that depend on problem parameters /x^, Aj, i = 1, . . . , N. Lai and Robbins |15j 
essentially proved the following lower bound on the regret of any bandit algorithm (see [T5] for a precise 
statement): 

N 



E[TZ{T)} > 



E 

,i=2 



o(l) 



In T. 



They also gave algorithms asymptotically achieving this guarantee, though unfortunately their algorithms 
are not efficient. Auer et al. [4] gave the UCB1 algorithm, which is efficient and achieves the following bound: 



E[K(T)] < 



N 1 

Ex 



i=2 



N 



lnT+(l + ^ 2 /3) E Ai 



\i=2 



More recently, Kaufmann et al. [13] gave Bayes-UCB algorithm which achieves the lower bound of [15] for 
Bernoulli rewards. Bayes-UCB is a UCB like algorithm, where the upper confidence bounds are based on 
the quantiles of Beta posterior distributions. Interestingly, these upper confidence bounds turn out to be 
similar to those used by algorithms in [5] and |16j . Our bounds in Theorem [T] achieve the asymptotic lower 
bounds of |15| . and match those provided by [14| for Thompson Sampling. 

Theorem[5]shows that Thompson Sampling also achieves a problem independent regret bound of 0(y/NT InT) 
on regret. This is the first analyis for TS that matches the il,(^/NT) problem-inpdependent lower bound 
(see [5]) for this problem within logarithmic factors. The problem-dependent bounds in the existing work 
implied only suboptimal problem- independent bounds: [T] implied a problem independent bound of 0(T 2 / 3 ). 
In |14j . the additive problem dependent term was not explicitly calculated, which makes it difficult to derive 
the corresponding problem independent bound, but on a preliminary examination, it appears that it would 
involve an even higher power of T. To compare with other existing algorithms for this problem, note that 
the best known problem-independent bound for the expected regret of UCB1 is also 0(V NT In T) (see [5]). 
More recently, Audibcrt and Bubeck [3] gave an algorithm MOSS, inspired by UCB1, with regret 0(V NT). 



2 Proofs 

In this section, we prove Theorem [1] and Theorem [5J The proofs of the two theorems follow the same steps, 
and diverge only towards the end of the analysis. 

Proof Outline: Our proof uses a martingale based analysis. Essentially, we prove that conditioned on any 
history of execution in the preceding steps, the probability of playing any suboptimal arm i at the current 
step can be bounded by a linear function of the probability of playing the optimal arm at the current step. 
This is proven in Lemma [TJ which forms the core of our analysis. Further, we show that the coefficient in 
this linear function decreases exponentially fast with the increase in the number of plays of optimal arm 
(refer to Lemma U) , this allows us to bound the total number of plays of every suboptimal arm, to bound 
the regret as desired. The difference between the analysis for obtaining the logarithmic problem-dependent 
bound of Thcorcm[T] and the problem-independent bound of Theorem[2]is merely technical, and occurs only 
towards the end of the proof. 

We recall some of the definitions introduced earlier, and introduce some new notations used in the proof. 
F^{-) denotes the cdf and fn P (') denotes the probability mass function of the binomial distribution with 
parameters n,p. Let F^ e g a (-) denote the cdf of the beta distribution with parameters a,/3. 

Definition 1. ki(t) is defined as the number of plays of arm i until time t — 1, and Si(t) as the number of 
successes among the plays of arm i until time t—1. Also, i(t) denotes the arm played at time t. 
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Definition 2. For each arm i, we will choose two thresholds Xi and yt such that fii < Xi < y% < H\. The 
specific choice of these thresholds will depend on whether we are proving problem- dependent bound or problem- 
independent bound, and will be described at the approporiate points in the proof. Define Lj(T) = d ^ T y ) , 
and jfij(t) = Si(t)/ki(t) (define fii{t) = 1 when ki(t) = 0). Define Ef(t) as the event that fii(t) < xi. Define 
Ef(t) as the event that 6i(t) < yi. 

Intuitively, Eff(t), Ef(t) are the events that fti(t) and 9i(t), respectively, are not too far from the mean 
Hi. As we show later, these events will hold with high probability for most time steps. 

Definition 3. Define filtration J-t-i as the history of plays until time t — 1, i.e. 

T t -\ = {i(w),r^ w) (w),i = 1, . . . , N, w = 1, . . . , t - 1}, 

where i(t) denotes the arm played at time t, and ri(t) denotes the reward observed for arm i at time t. 

Definition 4. Define, pij as the probability 

Pi,t=Pr{Oi{t)>y i \J r t-i)- 

Note that pi t is determined by Tt—\- 
Lemma 1. For all t € and i ^ 1, 

Pr (i(t) = i, E?{t), Ef(t)) | JU) < Pr (i(t) = 1, E?(t), Ef(t) \ F t ^) , 

Pi,t 

where p iit = Pr(#i(i) > y^Ft-i). 

Proof. Note that whether E^(t) is true or not is determined by Ft-i- Assume that filtration Tt-i is such 
that E^(t) is true (otherwise the probability on the left hand side is and the inequality is trivially true). 
It then suffices to prove that 

Pv(i(t)=i\Ef(t),^ 1 ) < Q—*£ p T (i(t) = l\Ef(t),^ 1 ). (1) 
1 Pi.t ' 

Let Mi(t) denote the event that arm i exceeds all the suboptimal arms at time t. That is, 

Mi(t) :0 t (t) >^-(*),VjV1- 
We will prove the following two inequalities which immediately give ((T|). 

Pr(i(t) = l|^(f),Ji_ 1 ) > p it t • Pr (Mi(t) | E%{t),T t -i) , (2) 
Pi(i(t)=i\E?(t),r t -i) < {l-p i ,t)-Vr{M i {t)\E d i {t),F t - 1 ). (3) 

We have 

Pr (i(f) = 1 1 E?(t),F t -i) > Pr (i(t) = l,Mi{t) \ E?(t),F t -i) = Pr (M^) | E? (t) , F t -i) -P* (i(f) - 1 | Mi(t) t E?(t),Ft-i) 

(4) 

Now, given Mi(t),Ef(t), it holds that for all j ^ i,j ^ 1, 

0j(t) < 6i{t) < y t , 

and so 

Pr(*(t) = 1 | Mi{t),El{t),F t -i) > Pr(0i(i) > Vl \ M t (t) , E\ \t) , T t -x) = Pr(0i(t) > Vl \ F t -x) = Pi,t- 

The second last equality follows because the events Mj(i) and Ef(t),Vi ^ 1 involve conditions on only 9j(t) 
for j ^ 1, and given JFt-i (and hence fij(t), kj(t),Vj), 6\(t) is independent of all the other 9j(t),j ^ 1, and 
hence independent of these events. This together with Q gives ([2]). 
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Since Ef(t) is the event that 9i(t) < j/j, therefore, given Ef(t), i(t) = i only if 9i(t) < yi. This gives ©: 

¥v{i{t)=i\Et{t),Tt-i) < Pv(e 1 (t)<y i ,9 i (t)>9 j (t),yj^l,\ E$(t),T t -x) 

= Pr(9 1 (t) < w | ■ Pr {9 t (t) > ^(i),Vj ^ 1 | £?(t),7i-i) 

= (l- Kt )-Pr(MW|i?fW,^-i)- 



□ 



Lemma 2. 



/ = 1 



Proof. This essentially follows from application of Chernoff-Hocffding bounds for concentration of fii(t). 
Refer to Appendix [B] for details. □ 

Lemma 3. 



5>r U(t) = i,E?(t),E?(t)j < L t (T) + 1. 
t=i 

Proof. This essentially follows from the observation that the beta-distributed random variable 9i(t) is well- 
concentrated around its mean when ki(t) is large, that is, larger than Li(T). Refer to Appendix [Cl for 
details. □ 

Lemma 4. Let Tj denote the time step at which j th trial of first arm happens, then 



1 + -2- 



J < -sr, 



> 



Vi,r j+ 1 [ l + 0(e + (j+l)A? e + e ^j/4_! )' 

w/iere A' { = fxi - y t , D { = y t log ^ + (1 - y,) log 

Proof. The proof of this inequality involves some careful algebraic manipulations using tight estimates for 
partial Binomial sums provided by [TJ]. Refer to Appendix [D] for details. □ 

Proof of Theorem [TJ and Theorem [2] Let r& denote the time step at which arm 1 is played for the k th 
time for k > 1, and let to = 0. Using the above lemmas, 



E[h(T)} = £Pr(i(t)=i) 
t=i 

T T T 

= J>r(i(i) = i,Ej>(t),E?it)) + £>r = i, £f + £ Pr (*(t) = i,^ 

+ L t (T) + l + 

w) = i) +i 



t=i 

r 



< 



(*) < 



*=i 

T-l 



Pi,* 



mt) = l,E°(t),E?(t)) 



< 



k=0 
T-l 

fe=0 

24 

Af 4 



(1 -Pi,T fe + l) 



Tfc + 1 



Pi,T fc + l 



1 



T-l 



t=T k + l 



+ Li{T) + l 



1 

d(xi,/Xi) 
1 



d(xi,(j>i) 



1 



U{T) + 1 



1 



(5) 
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The inequality marked (*) uses the observation that p^t = Pr(#i(£) > y^Ft-i) changes only when the dis- 
tribution of 9\{t) changes, that is, only on the time step after each play of first arm. Thus, pij is same at 
all time steps t £ {tu + 1, . . . , t^+i}, for every k. 



For logarithmic problem-dependent bound of Theorem [TJ for some < e < 1, we set Xi £ (/i,, fii) 
such that d(xi,fix) = d(jii, /ii)/(l + e), and set yi £ (xj,//x) such that d(xi,yi) = d(xi, + e) = 

d(Mi,/ii)/(l + e) 2 (0). This gives 

InT . NO InT 

L ^ T ) = -r< t = ( 1 + £ ) 



d(xi,yi) 

Also, by some simple algebraic manipulations of the equality d(xi, fi\) = d(/ii, + e), we can obtain 

e d{fii,fix) 



Xi - Hi > 



+ ln(*#=£4l 

^i(l-Mi) / 



giving 

-J— < 2 =0(1). 

d{xi,^i) (xt-fii) 2 e 2 

Here order notation is hiding functions of /LtjS and AjS, since they are assumed to be constants. 

U ^ e + (j + 1)A? + e A<V4 _ ! J - U [a? + AfD + A? + A?) U(L) - 

Combining, we get 

w)] = £a«t)] = Ea+^) 2 ^A l + o(f)<E(i + o^A J + 0( ^), 

where e' = 3e, and the order notation in above hides /i^s and A^s in addition to the absolute constants. 



For obtaining 0( V 'NT InT) problem-independent bound of Theorem [2j wc pick Xi = \ii + %>, = 

/ii — Ai j so that A' 2 = (/ii — ) 2 = and using Pinskcr's inequality, d(xi,fj,i) > \{xi — fii) 2 = 
d(xi,yi) > \{yi - Xi) 2 > jg. Then, 

^^"d^.w)- A 2 ' 

1 . 18 

d{xi,m) - A 2 ' 



(j + 1)A; 2 e±?i/*-lJ ~ f^ V C? + 1)A? Afj 



e 
e 



1 InT 
Af + Af 
InT 
Af 



2 This way of choosing thresholds, in order to obtain bounds in terms of KL-divergences d(fii, rather than AjS, is inspired 
by HEDIS]. 
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This gives, 



We observe that in the worst case, for all suboptimal i, Ai > \j — ^ T . This is because the total regret on 



playing arms with Ai < J N £ T instead of the optimal arm is at most v NT In T. Thus, all the arms with 



A, < J N ]^ T can be assumed to be optimal arms. Also, in [T] we proved that multiple optimal arms can 
only help. 

Therefore, substituting Ai = 



T ' 



E[TZ(T)] = 0(VNT lnT) 

□ 
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A Some results used in the proofs 

Fact 1 (Chernoff-Hocffding bound). Let X\, . . . ,X n be independent — 1 r.v.s with E[Xi] = pi (not neces- 
sarily equal). Let X = = %%> M = E \ x \ = ^ Yh=i Pi- Then, for any < A < 1 - [i, 

Vx{X > fx + A) < cxp{-nd(fi + A, (j)}, 

and, for any < A < fi, 

Pr(A < (i - A) < exp{-nd(fi - A, //)}, 
where d(a, b) = a In ^ + (1 — a) In . 

Fact 2 (Chernoff-Hocffding bound). Let X\, ...,X n be random variables with common range [0, 1] and such 
that E [X t | X x , X t -i] = V- Let S n = X 1 + . . . + X n . Then for all a>0, 

Pr(5„ >nn + a)< e~ 2a ^ n , 

Pr(5„ <nn-a)< e~ 2a2/n . 

Fact 3. 

F b a ff(y) = l-F* + p_ 1 Ja-l), 

for all positive integers a, (3. 
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B Proof of Lemma [2] 



Let Tfc denote the time at which k trial of arm i happens. Let tq = 0; Then, 



£)Pr(i(t)=i, < E 



t=i 



T t b + 1 



= E 



3 



< E 



< 1 



E im=i)m(t)) 

fe=li=r fc +l 

T-l T fc+ i 

^/(Kf(r fc + 1)) E W)=0 

fe=0 
T-l 

£/(£?(t* + 1)) 



.fe=0 
T-l 



U=0 



£l(£?(7* + l)) 

r-i 

E I(E?(T k + l)) 



k=l 



T-l 



< 1 + E exp(-M(a; i , /ij)) 



fe=i 



< 1 



d(xi,fii) 



The second last inequality follows from the observation that the event E^(t) was defined as fii(t) > Xi, 
where fli(t) is the average of the outcomes observed from the plays of arm i until time t — 1. Thus at 
time Tfc + 1, pii{Tk + 1) is simply the average of the outcomes observed from k i.i.d. plays of arm i, each 
of which is a Bernoulli trial with mean /i.;. Using Chernoff-Hocffding bounds (Fact [T]), we obtain that 
Pr(£i(Vfc + 1) > Xi) < e - kd{ - Xi '^\ □ 

C Proof of Lemma [3] 



P I (i(t) = i,E?(t)\E?(t),r t -i) < Pi(6 i (fi)>y i \(i i (t)<x i ,rt-i) 

= Pr (Beta(Mt)h(t) + 1, (1 - &(*))**(*) + X ) > 2/i I Ai(*) < *i) 

< Pr (Beta{xiki{t) + 1, (1 - s<)*j(f) + 1) > j/i) 
= ^(t)+i, V4 (^*i(*)) (J'octB) 

< F^ yi {xMt)) 

< e -ki(t)d(xi,yi) 

where the last inequality follows from Chernoff-Hocffding bounds (refer to Fact Q}. Therefore, for t such 
that ki(t) > Li(T), 

1 



Pr (i(t)=i,E?(t) E?{t),F t -i) < 



Let r be the largest time step until fcj(t) < Lj(T), then, 
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^Pi (i(t) = i,E°(t),E?(t)) < J2Pv(i(t)=i,Ef(t) E?(t) 



3 



3 



J2^(i(t) = i,Ef(t) E?{t),T t -i 



Y^Pr(i(t) = i,Ef(t) £f(i),^-i 



< E 



< E 



£Pr(i(t) = i,iS?(f) flt®,Ft-i 



3 



£Pr (i(t) = i,25?(t) E<t{t),F t -x 



< L t (T) + l. 



t = T + l 



J2 Pr (i(t)=i,Ef(t) E?(t),?t-i 

=T+1 

T 1 

^ 



□ 



D Proof of Lemma |4] 

Let fci(t) = j, Si(t) = s. Let y = yi. Then, pi, t = Pr(6*i(t) > y) = Fj +1 y (s). Let Tj + 1 denote the time step 
after the (j) th play of arm 1. Then, k\{rj + 1) = j, and 



1 



Let A' = /ii — y. 

For j < A: Let i? = D = y l og ^ + (1 - y) log 



■ f j+i.w(«) 



E 



s=0 



< 



< 



— E 

1 - y ^ 

i ^4^(fl + i £ 2/ (s) 



< 



< 



(l-2/> 



A' + A' 
3 

A 7 ' 



(1 - y)j A' 



(6) 
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For j > We will divide the sum Sum(0,j) = J2l=o iS (l) into four P artia l 

sums and prove 



that 



Sum(0, [yj] - 1) < e^-^^^J+e(e- 2A i), 

Sum([yj\,[yj\) < 3e~ D: > , 

Sum{\nxj- < 1 + eA ^/4„ 1 - 



Together, the above estimates will prove the required bound. 
We use the following bounds on the cdf of Binomial distribution [12j Prop. A. 4]. 
For s < y(j + 1) - y/(j + 1)2/(1 - y), 

For s > y(j + 1) - VO' + l)y(l = y), 

Bounding Sum(0, [yj\ — 1). Using the bounds just given, for any s, 



This gives 

LyjJ- 1 / \ \ bs/ j J — 1 

(i-y)^ 1 § V 1 ~ vU + 1) 



w ( o, LyiJ - 1) < e I £ ( 1 - —— )-R s ]+ e(i) £ /***(•)■ ( 7 ) 



We now bound the first expression on the RHS. 

V + (l-yp+^fi-i 2/(i + i)V fi-i (^-i) 2 , 

< (i-mi) j f 1 R lyn , (wCi + 1) - LwiJ + 1) ^ LmJ 

< 



V»0' + 1) 2 y(i + i) (-H-i). 

(1-Mi) j 3 JjLw'J+i 

(1 - vY +1 yti + i)(R- i) 2 

3 R 



y{l-y)(] + l) (R-1)' 



The last inequality uses 



(! - ^Lvij . ^-Miy' jyj = & -Dj 



And, = iiiii=sl. r 
1 -R 1 /zi(l-y) 2/(1 — A*x) 1 /ii(l-/ii) 



Now, J? - 1 = ^#^4 - 1 = And, #y = Therefore 



y(l-y)(i + i) (i?-i) 2 y(l-y)0' + l) Mi-y Mi-z/ U + 1) G"i - y) 5 
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Substituting, we get 

(i - my y^f s \ Dj i Mi -mi) 

(! - & V wCj + - (i + 1) (mi - y) 2 ' 

Substituting in ([7]) 

Lw'J-i 



Sum(0, [yj\ -1) < 9 +9(1) £ /,, Ml ( S ) < 9 (V^ J^) + e( e - a ^-.)"i). 

(8) 



Bounding Sum([yj\, [yj\). We use < +J+^gy = (l - R s {^0r, to get 

s«m(LwJ,LwJ) WLwJ) 



< 



< (1-^ + ^) ^,(1-^ 



i - y (i - y) J 

< 3e~ Dj . (9) 



The last inequality uses j > > jz^j- 



B ounding Sum (\yj~\, \jjnj - Now, if j > ^7,then y/(J + l)y(l - y) > y/y > y, so y(j + 1) - 

V (i + 1)2/(1 — 2/) < U.i ^ [jA?!- Therefore, for s > fyj], i^'+i^s) = 9(1). Using this observation, we derive 
the following. 

A' LMu'-^jJ , , s 

S U m(rw1,LMiJ-=-jJ) = E 



©| E -W s ) 

»= rwi 

= e( e - A ' 2 ^ 2 ), (10) 

where the inequality follows using Chcrnoff-Hocffding bounds (refer to Fact [2]) . 

Bounding Sum(\/j,xj — , j). For s > [/iij — 4j-/| = + , again using Chernoff-Hoeffding bounds 
from Fact [2j 

F j+h y(s) > 1 - e - 2 to+¥j-ya+l)) 2 /(j+l) > 1 _ e 2A' e -A' 2 j/2 > x _ e A' 2 J /4 e -A' 2 J /2 = ! _ e -A' 2 i/4_ 
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The last inequality uses j > ^7. 

Sum([mj - — = 

< 



Combining, we get for j > -^7, 

E\ — 1 < 1 + e( p -A' 2 i/2 , I e -Dj , L 
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1 



l _ e -A'2j/4 
1 



